### DEGREE OF FLEXIBILITY... REGRESSION VS SMOOTH SPLINES ###

 

setwd("C:/Users/baron/627 Statistical Machine Learning/data") 

load("Auto.rda")

attach(Auto)

 

# Fit REGRESSION model predicting miles per gallon based on weight.

# Plot the regression line in red color with thickness=3

 

> reg = lm(mpg~weight)

> plot(weight,mpg)

> abline(reg,col="red",lwd=3)

 

 

# In general, abline(a,b) plots the line y = a + bx

 

# Fit a SPLINE with 2 degrees of freedom (straight line) to our data and plot it.

 

> spline2 = smooth.spline(weight,mpg,df=2)

> lines(spline2,col="blue")

 

# Add one more degree of freedom -> quadratic spline

 

> spline3 = smooth.spline(weight,mpg,df=3)

> lines(spline3,col="green",lwd=4)

 

# Increase flexibility by adding more d.f.

 

> spline20 = smooth.spline(weight,mpg,df=20)

> lines(spline20,col="brown",lwd=4)

 

# Blow flexibility to 100 degrees of freedom.

# The resulting spline is heavily dependent on each data point.

# Its prediction power is very low.

> spline100 = smooth.spline(weight,mpg,df=100)

> lines(spline100,col="orange",lwd=4)

 

# While spline100 is very flexible, and it matches the data most closely, it would not be powerful for prediction. We’ll learn how to measure prediction accuracy with various cross-validation tools.

 

 

How to choose the optimal method? Cross-validation technique.

 

> n = length(mpg);       Z = sample(n,n/2);       attach(Auto[Z,]);         # This will be our training data

> ss5 = smooth.spline(weight, mpg, df=5)                                         # Fit the spline using training data only

 

> attach(Auto);

> Yhat = predict(ss5, x=weight)

> names(Yhat)                                                                                     # Notice: prediction consists of two parts – predictor x and

[1] "x" "y"                                                                                            # predicted response y. We can call them Yhat$x and Yhat$y.

 

> mean(( Yhat$y[-Z] - mpg[-Z] )^2)                                                     # Then, compute prediction mean-squared error on test data

[1] 17.76551                                                                                        # This is the cross-validation error for a spline with df=5

 

 

# Try many different splines and choose the one with the smallest prediction error.

 

> cv.err = rep(0,50);   

> for (p in 1:50){

+ attach(Auto[Z,]);  ss = smooth.spline(weight, mpg, df=1+p/10)     # Try DF = 1.1, 1.2, …, 6.0

+ attach(Auto);      Yhat = predict(ss, weight)                                     # DF must be > 1

+ cv.err[p] = mean( (Yhat$y[-Z] - mpg[-Z])^2 )

}

 

> df = 1+(1:50)/10

> plot(df,cv.err)

 

> which.min(cv.err)

[1] 20

> df[20]

[1] 3

 

# This cross-validation method chooses the spline with 3 d.f.